Method of switching fabric for counteracting a saturation tree occurring in a network with nodes

ABSTRACT

Method and Switching fabric for ccounteracting a saturation tree occurring in a network with nodes An example of a method comprises the steps of generating at a local node where a congestion emerges a first congestion information; sending the first congestion information to at least one upstream node; responsive to one received first congestion information comparing the content of the received first congestion information with a present local status based on a set of predefined rules in order to identify at least one packet stream causing the congestion, and generating a second congestion information comprising the identified at least one packet stream causing the congestion; and sending the second congestion information to at least one further upstream node from where the identified at least one packet stream was received.

TECHNICAL FIELD

The present invention relates to a method for counteracting saturationtrees occurring in switch-based information networks.

BACKGROUND OF THE INVENTION

Packet blocking and flow interference in packet-switched interconnectslead to congestion and saturation trees that could cause performancecollapse. Non-interfering architectures with independently flowing dataflows were practically approximated by static-a priori definition andreservation of end-to-end resources, e.g. links, virtual channels/lanes,buffers, queues, that are allocated to the data flows. Such approachesare effective, although heavy in overhead and limited in scalability.

Other approaches, such as the asynchronous transfer mode (ATM) and IPcan prevent saturation trees by sacrificing losslessness. The generalmethod to attain a scalable and stable network architecture as used inTCP/IP and ATM networks builds on end-to-end flow control, window- orrate-based, respectively. The main detractor here is convergence speedbecause of long delays. Whereas a reaction time of milliseconds isadequate for large/slow networks, server and storage interconnectionnetworks require microsecond solutions or faster, to prevent saturationtrees and catastrophic performance degradation. Thus, this method ismore appropriate for long-lived (static) congestion than for short-lived(dynamic) congestion management. In such an environment, congestionleads to excessive loss (drop) rates.

In non-provisioned interconnection networks (SAN, StAN, HPC etc.),congestion control is considered as one of the difficult challenges.Non-interfering architectures are described by G. F. Pfister and V. A.Norton, “Hot Spot Contention and Combining in Multistage InterconnectionNetworks”, IEEE Trans. on Computers, Vo. C-34, No. 10, October 1985, pp.933-938; or by W. Dally, “Virtual-Channel Flow Control”, IEEE Trans. onParallel and Distributes Systems, Vol. 3, No. 2, March 1992, pp.194-205.

Dynamic non-interference via reactive flow and congestion controlremains an open issue of increased interest for supercomputer, serverand storage interconnection networks. Reactive flow and congestioncontrol is a hard space-time problem, because an average network with(tens of) thousands of nodes should resolve contention between manyflows sharing the interconnection network's resources. The issue is howto disseminate accurate and timely status information to all trafficparticipants, i.e. a large address-space identifying flows and theirresource allocations should be communicated with low latency—globally—orif possible, on a need-to-know basis.

United States Patent U.S. Pat. No. 5,768,258 describes a selectivecongestion control mechanism for information networks to mitigate theloss rate. The congestion control mechanism is especially used for ATMnetworks supporting data services or other non-reserved bandwidthtraffic. The control mechanism reacts upon detection of a trafficbottleneck by selectively and temporarily holding back the data trafficthat is to travel via the bottleneck. A congested node transmitscongestion notifications containing one routing label information perflow and deferment information to upstream nodes, thus enabling aselective temporary backpressure action. For detecting a congestion, thebuffer occupancy of an output port of a node is monitored and if theoccupancy exceeds a given threshold, congestion is detected. Acommunication and switch-based ATM network is connection-oriented andall ATM cells belonging to a connection follow the same path by swappingthe routing labels at the input port of each switch. Thus, the actualrouting decisions take place only during connection and set-up androuting is not considered as a critical issue in the ATM environment.Upstream switching nodes are informed on a hop-by-hop basis about thetraffic flows that should be back-pressured to attenuate the congestion.The congestion notification comprises the information that selectedcells that flow via the bottleneck link have to be held back for aduration of time. In fact, this induces saturation trees.

In the known congestion controlling methods a tree of upstream nodes isblocked if a congestion globalizes. There is no differentiation betweendata packets that cause the congestion (culprits) and data packets thatare only victims of the congestion if “culprits” and “victims” share thesame buffer. With VPI/VCI labelling, only one label can be used perflow, i.e. the selectivity is fixed.

In addition to the prior art, it is a general object of this inventionto provide a method to dynamically counteract against saturation treesin a lossless packet-switched multistage interconnection network. It isa further object of the invention to rapidly attenuate dynamiccongestion in interconnection networks (SAN, clusters, supercomputers)by providing on-demand resource non-interference. It is a further objectof the invention to provide a scheme that counteracts saturation trees,prevents buffer overflows, and underflows and that enables moreefficient use of the switching capacity of a switching network. Whereasthe prior art also performs a selective form of backpressure with fixedgranularity, it is an object of the invention to adapt the granularityof the selection to reduce the congestion signalling overhead.Efficiency is better with variable granularity.

SUMMARY OF THE INVENTION

The present invention provides a selective congestion control mechanismthat provides dynamic reactive congestion control that could be used forexample in a buffered crossbar, CIOQ, shared-memory and any other switcharchitecture.

According to one aspect of the invention, there is provided a method forcounteracting a saturation tree occurring in a network having nodes,wherein data packet streams are transmitted over the nodes. Each nodehas at least one input and one output, wherein the data packets arereceived at an input of the node, and emitted over a predeterminedoutput of the node, depending on the destination of the data packet. Themethod comprises the steps of generating at a local node where acongestion emerges a first congestion information; sending the firstcongestion information to at least one upstream node; in response to onereceived first congestion information comparing the content of thereceived first congestion information with a present local status basedon a set of rules in order to identify at least one packet streamcausing the congestion (culprits), and generating a second congestioninformation comprising the identified at least one packet stream causingthe congestion, i.e. the second congestion information indicates theidentified packet streams (culprits); and sending the second congestioninformation to at least one further upstream node from where theidentified at least one packet stream was received.

In accordance with a further aspect of the invention, there is provideda switching fabric for counteracting a saturation tree occurring in anetwork having nodes. A switching fabric comprises a first processingunit and a first memory adapted to generate at a local node where acongestion emerges a first congestion information, a first portcontrolled by the first processing unit for sending the first congestioninformation to at least one upstream node, a second processing unit anda second memory adapted to compare the content of the received firstcongestion information with a present local status based on a set ofpredefined rules in order to identify at least one packet stream causingthe congestion, and to generate a second congestion informationcomprising the identified at least one packet stream causing thecongestion, and a further port for sending the second congestioninformation to at least one further upstream node from where theidentified at least one packet stream was received.

Advantageous results are attained by recalculating the hold time innodes upstream from the congestion root before sending a congestionmessage from such nodes. For the recalculation of the hold time thelocal circumstances of round-trip time and buffer occupancies areconsidered.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are described in detail below withreference to the drawings. To show more clearly the general inventiveconcept, an implementation in a typical switching scenario is assumed.

FIG. 1 schematically illustrates a data network with several stages ofnodes.

FIG. 2 depicts one node with two input ports and two output ports.

FIG. 3 shows a more precise view of a network with nodes.

FIG. 4 depicts a further network with four nodes.

DETAILED DESCRIPTION OF THE INVENTION

The present invention discloses a selective congestion control mechanismthat provides dynamic reactive congestion control that could be used forexample in a buffered crossbar, CIOQ, shared-memory and any other switcharchitecture. According to the invention, there is provided a method forcounteracting a saturation tree occurring in a network having nodes,wherein data packet streams are transmitted over the nodes. Each nodehas at least one input and one output, wherein the data packets arereceived at an input of the node, and emitted over a predeterminedoutput of the node, depending on the destination of the data packet. Themethod comprises the steps of generating at a local node where acongestion emerges a first congestion information; sending the firstcongestion information to at least one upstream node; in response to onereceived first congestion information comparing the content of thereceived first congestion information with a present local status basedon a set of rules in order to identify at least one packet streamcausing the congestion (culprits), and generating a second congestioninformation comprising the identified at least one packet stream causingthe congestion, i.e. the second congestion information indicates theidentified packet streams (culprits); and sending the second congestioninformation to at least one further upstream node from where theidentified at least one packet stream was received.

The step of sending the second congestion information can compriseforwarding the second congestion information to a source node from whichthe identified at least one packet stream originates. This allows todirectly inform the source node, e.g. to stop the packet stream, whilethis can be achieved faster than the saturation tree develops.

In an advantageous embodiment the first congestion information comprisesan identifier identifying a congested root, e.g. a port or channel. Thefirst congestion information is sent upstream. When an upstream nodereceives the first congestion information it stores the identifier independence on the set of rules in one of a first list, herein alsoreferred to as blacklist (BL), indicating the data stream causing thecongestion and a second list, herein also referred to as graylist (GL),indicating data streams suspected of causing congestion.

The second congestion information can comprise the identifier orcomprise groups of identifiers of data flows, i.e., labelling multipledata flows or streams. The second congestion information is sent furtherupstream, where a further upstream node receives the second congestioninformation and stores the identifier in dependence on the set of rulesin the first list (BL) or the second list (GL). It is advantageous thatthe identifier can identify multiple data flows as a so-calledcongestion culprit set (CC-set), because this reduces the congestioninformation sent around the network.

The step of generating a first congestion information at a local nodecan further comprise detecting the emerging congestion by applying theset of predefined rules. In an advantageous embodiment each nodecomprises the set of predefined rules for detecting an emergingcongestion. However, it depends on the conditions of the set ofpredefined rules which congestion information is generated or which listis used.

Further, based on the set of predefined rules each identifier ofincoming packet streams can be compared with the identifiers stored inthe second list (GL). When a condition of the set of predefined ruleswith respect to one identifier holds or is valid, then said identifieris transferred from the second list (GL) to the first list (BL). Therespective node is then aware of packet streams causing congestion andcan inform other nodes about those streams.

The second congestion information with the identifier can be sentupstream if the identifier is stored more than once in the first list(BL). This has the advantage that the upstream nodes are informed andcan react accordingly.

A receiving source node, i.e., the node where the packet stream causingthe congestion stems from, reduces the sending rate of the packet streamidentified and sends a test packet to the local node where thecongestion emerged. Thereby entries of the received identifier in thesecond lists (GL) can be removed along the way of the test packet to thelocal node where the congestion emerged. This helps to clear up theentries in the second lists (GL). Entries in the first list (BL) areself-cleaning as these comprise an expiring time t after which the entryis removed.

In the invention, there is also provided a switching fabric forcounteracting a saturation tree occurring in a network having nodes. Theswitching fabric comprises a first processing unit and a first memoryadapted to generate at a local node where a congestion emerges a firstcongestion information, a first port controlled by the first processingunit for sending the first congestion information to at least oneupstream node, a second processing unit and a second memory adapted tocompare the content of the received first congestion information with apresent local status based on a set of predefined rules in order toidentify at least one packet stream causing the congestion, and togenerate a second congestion information comprising the identified atleast one packet stream causing the congestion, and a further port forsending the second congestion information to at least one furtherupstream node from where the identified at least one packet stream wasreceived. Thee switching fabric can further comprise a first list (BL)for storing identifiers indicating data streams causing the congestionand a second list (GL) for storing the identifiers indicating datastreams suspected of causing congestion.

Each of the congestion information, also referred to as congestionmessage, can be sent to upstream nodes of a previous stage from whichdata packets were received by the local or congested node. The upstreamnodes that receive the identifier store the identifier and afterreceiving the congestion message, the upstream node can hold back thedata packets or streams with the stored identifiers and lets pass thedata packets with different identifiers to the local or congested node.The advantage of this is that only the data packets that are congestionsuspects are held back. Therefore, not a whole node is blocked by thecongestion message, but only the data packets that cause the congestionare blocked. Therefore, other data packets can still be transmitted tothe node at which a congestion arose.

In an advantageous embodiment, the first congestion information ormessage is only sent to nodes from which congesting data packets werereceived. Therefore, less information has to be transmitted in thenetwork, reducing the traffic. The nodes of a first stage check whethera congestion is emerging and send the identifier of the congesting datapackets within the second congestion information or message to nodesmultiple stages upstream from which they directly or indirectly receivedata packets. Therefore, the data packets that will cause a congestionare withheld more than one stage away from the congested nodes.

The second congestion information or message can comprise an expediteinformation (scope_K). This allows to transport the second congestioninformation faster than the saturation tree grows.

In a further embodiment, the congestion information can comprise a holdtime during which the upstream nodes will hold back the identified datapackets. Therefore, it is not necessary to send a second message to theupstream nodes to resume transmission of the identified data packets if,for example, the congestion is resolved. However, in a furtherembodiment the retaining of data packets during the hold back time iscancelled by receiving an attracting information, i.e., a special typeof flow control (FC) event, such as a qualified credit.

The upstream nodes that receive one congestion message store the holdtime and the identifier of the data packets suspected of congestion.Within the hold time the upstream nodes retain the data packets with thestored identifiers. Therefore, only the data packets that will cause thecongestion are held back. Data packets with other identifiers are passedto the node from which the congestion message was received. Therefore,the data traffic is constrained only as much as necessary.

In a further advantageous embodiment, a data packet flow, also referredto as stream, to which the data packets suspected of congestion belongis determined as identifier for the congesting data packets. The dataflow or packet stream can be detected by analyzing the header of a datapacket. An example of marking suspect data packets is checking the freecapacity of a memory portion for a given output port. Another example isto mark data packets with the highest memory occupancy of a given outputport with an emerging congestion. In both examples, the identifiers ofthe marked data flows or packet streams are sent to an upstream node.

In another advantageous embodiment, an emerging congestion is checkedand controlled individually for each output port of a first or localnode.

In an advanced embodiment the congestion message is sent upstreamseveral nodes without checking in each upstream node whether there is alocal emerging congestion. This assists in sending upstream thecongestion message sooner than the spreading of the saturation trees. Inthis embodiment, the congestion message has no specified duration duringwhich data packets should be withheld. This message is then interpretedas a notification of a suspect packet or data flow.

Further advantageous results are attained by recalculating the hold timein nodes upstream from the congestion root before sending a congestionmessage from such nodes. For the recalculation of the hold time thelocal circumstances of round-trip time and buffer occupancies areconsidered.

In switch-based networks, it may occur that during a period of time aswitch receives more data packets than it can handle. If the aggregateincoming rate of data packets is larger than the outgoing data rate, abottleneck exists at this switch. In order to prevent a congestion witha large saturation tree that greatly deteriorates general networkperformance, a method for counteracting saturation trees is provided insuch networks.

Before other embodiments are described, some general issues with respectto the present invention are addressed with reference to FIG. 4. FIG. 4depicts an embodiment with a local or fourth node 30, a fifth node 35, asixth node 36 and a seventh node 37. Each of the fourth, fifth, sixth,seventh node 30, 35, 36, 37 comprises two input ports 33, 34; 44; 74, 75and two output ports 31, 32; 43; 45; 47, 49 and further a fourth, fifth,sixth, seventh processing unit 65, 67, 66, 68 and a fourth, fifth,sixth, seventh memory 69, 71, 70, 72 for storing data packets,respectively. Each processing unit 65, 67, 66, 68 has access to anadditional memory 41, 40, 38, 39 wherein a first list (BL), herein alsoreferred to as blacklist, and a second list (GL), herein also referredto as greylist, is provided. The black- and greylist can also be storedor provided in the buffer memory 69, 71, 70, 72. The input and theoutput ports of the nodes are connected with bidirectional links forsending and receiving data packets. The fourth node 30 with the seventhoutput port 32 is contemplated as a hotspot, labelled with HS, where acongestion occurred. The fourth node 30 is connected upstream by theseventh input port 33 with output port 45 of the sixth node 36. Thefourth node 30 with the eighth input port 34 is connected with theoutput port 43 of the fifth node 35. The sixth node 36 is connected bythe input port 74 to the output port 49 of the seventh node 37. Each ofthe nodes 30, 35, 36, 37 is embodied as a double input and outputswitching system. Depending on the destination address of the datapackets, the nodes 30, 35, 36, 37 transmit the data packets that werereceived by the input ports to one of the output ports.

Elements of Bipolar Flow Control (BFC)

When referring to a congestion or information message it can be one ofthree types (1), (2), (3) which are flow control (FC) messages sendablein-bound within the network. A first congestion information or message(1), labeled in FIG. 4 with 50, and also denoted as Hold_all message orsignal is used and defined in more detail below. Further, when referringto a second congestion information or message (2), (3), labeled in FIG.4 with 51, one of two types of a Hold_this message or signal is usedwhich are defined in more detail below.

In general, there are three types of flow control messages:Hold_all(node|port_ID, t, info), Hold_this(set|flow_ID, t), andHold_this(set|flow_ID, info).

(1) Congested signal: The Hold_all(node|port_ID, scheduling horizon,info) message applies to all data packets going to an indicatedhotspotted downstream output port. The Hold_all message is sent upstreamfrom the node with the hotspotted output, i.e. from the root of asaturation tree (e.g. the seventh output port 32 of the fourth node 30).The Hold_all travels one stage upstream (to nodes 35, 36). It does notnecessarily comprise any specific flow identifiers, but an output portnumber (of the seventh output port 32). For efficiency, the flowidentifiers of all the packet flows or streams destined for that outputport 32 are not put on a blacklist. During a horizon duration thewithholding is performed at transmission schedulers (TXS) (in nodes 35and 36) which control which data packets are allowed to travel on thelinks to the downstream node. The transmission schedulers (TXS) are notshown for simplicity. The info field is described below.

(2) Congesting_culprit signal: The Hold_this(flow|set_identification,scheduling horizon, info) message applies to so-called culprit packets(deemed as causing congestion) and going to at least one end destinationaddress. These packets are identified by the flow identification (foundin data packet headers). The end destination address of the flowidentification is placed as identifier in an entry on the blacklist. Asexplained below in the (congestion culprit) CC-set, to reduce flowcontrol signalling overhead, any flow identification can represent amultitude of flows sharing at least one congested segment. The Hold_thismessages are typically sent by nodes in stages upstream from the root ofthe flow control (FC) (nodes left of node 30). Any of these nodes may(re)calculate its corresponding scheduling horizon for blacklistedpackets, i.e., saturation tree culprits. The info field is describedbelow.

(3) Congesting_suspect signal: Hold_this(flow|set_identification, info)message applies to packets suspected of causing the hotspot and going toat least one end destination address. Such packets are to be placed on agraylist, from where they can be upgraded to the blacklist, contingentupon the set of rules. The set of rules are usually a set of predefinedrules.

info={HS_severity, scope_K, update}, where,

-   -   HS_severity=ratio of aggregate arrivals vs. the hotspot service        rate (8-64 bit). Can be also aproximated in a 2-bit field:        0—urgent; 1—very severe; 2—severe; 3—medium;    -   scope_K=number of stages to propagate this message; 0=>immediate        propagation to sources, bypass any checks and recalculations;        default=1 (single hop, with per node revalidation); 2, 3 . . .        hops direct.    -   update=how fresh are the HS_severity, scheduling horizon fields.        If update=0, this is an original message content, as it was        initially issued; otherwise the fields were updated at every        stage.

Typically, any root node can issue the first congestion information ormessage (1). From the next stage upstream, the nodes issue the secondcongestion information or message (2), (3), which will be propagatedupstream according to the scope_K field: expedited (direct to sources),multihop skip, or hop-by-hop.

Calculation of Scheduling Horizons “t” for Different Flow Control (FC)Messages

Hold or deferment times are also called “scheduling horizons” for theupstream transmission scheduler(s) (TXS).

(A) Calculation of the scheduling horizon for the Hold_all (output portnumber, scheduling horizon) message takes place in the node that is theroot of the tree. The scheduling horizon T=max(NCP−RTT,0), where NCP isa number of cold packets, i.e. packets not considered of causingcongestion, and RTT a round trip time. The message is preferably onlysent when T is not zero.

(B) The (re)calculation of the scheduling horizon of the Hold_this(flowidentification, scheduling horizon) message is based on the local NCP,and on the local RTT with the next upstream node (e.g. the sixth node 36would take its local NCP and the RTT of the link connecting the nodes36, 37). A further scheduling horizon T′=max(NCP−RTT,0), where NCP iscalculated differently from (A). Here, NCP is the number ofnon-hotspotting packets expected to keep flowing (out of the sixth node36) while the Hold_this message travels upstream (to the seventh node37) and while the scheduling horizon is active (in node 37) and whilenew data packets are travelling downstream after the scheduling horizonexpiry (from the seventh node 37 via the link to the sixth node 36).Since only data packets of specific blacklisted flows are held, and notall data packets destined for a particular downstream output port, theNCP in a non-root node is defined as “all currently locally availabledata packets not belonging to blacklisted flows”. NCP for an output (e.gthe output of the sixth node 36) is calculated as follows: NCP=(the sumof all currently present data packets for this output) minus (the sum ofcurrently present data packets belonging to blacklisted data flows forthis output).

Black- and Graylists

As indicated above, each node has in the additional memory 41, 40, 38,39 a blacklist, short BL, and an optional graylist, short GL, which canbe implemented as a table. The BL table stores the set|flow IDs andscheduling horizons of already known hotspot culprits, i.e. packetsdeemed as causing congestion, while the GL does the same for suspects,i.e. packets that are suspected of causing congestion. However, asgraylisted flows have no scheduling horizons associated with, they willnot expire after t. Instead, a GL entry can be either upgraded to BL(new culprit, with default t=RTT), or cleared later on by a special testpacket from the source or source node. A so-called garbage collectionmethod can clean the stale GL entries. Typically, one blacklist (BL) isprovided and therefore stored per output port.

Blacklist

As mentioned, each node comprises one blacklist. All data packetsbelonging to a blacklisted flow are held at an output until its currentscheduling horizon expires. A basic blacklist entry shows the followingformat: Blacklist Entry=(flow identification, scheduling horizon,occupancy count). In order to (re)calculate the scheduling horizon forHold_this(flow identification, scheduling horizon) messages, eachblacklist entry comprises an occupancy count next to the flowidentification and the scheduling horizon. Each time a data packetenters the node at any input port and if it is destined for an outputport that has a blacklist entry for the flow this packet belongs to, theoccupancy count is increased by one. The static sum of all blacklistoccupancy counts for an output port is the second term in the above NCPcalculation for Hold_this messages; static sum, because for efficiencyonly the 1 st order approximation is calculated, namely the schedulinghorizons that qualify each blacklist entry is disregard—therefore thecounters's sum a static snapshot. When the scheduling horizon expires,the entry and thus the count is cleared. When a new entry is added tothe blacklist, the occupancy count is initialized to zero.

Graylist

The graylist is built with suspects identified by “Congesting_suspect”Hold_this(flow|set_identification, info) messages. The difference to theblacklist is the absence of a scheduling horizon, and hence this messageis forwarded upstream without a (re)calculation of the schedulinghorizon. After a flow's blacklist local re-entry or another set ofrules, this message can be sent as an early warning for a programmablenumber of stages upstream. Congesting suspects are kept on the graylist,thus also helping future packet marking upstream. An expensivehistogramming method, for example, is not needed. The graylist alsoimproves marking accuracy in multiple-hotspot scenarios.

The greylist can be used for marking “congestion culprits”. If a flow orstream is graylisted in a node and later on a sizable, determinedfraction of packets matching this flow are causing locally incipientcongestion, i.e. this node becomes congestion root, then these packetflows are blacklisted.

Elaborations on the BFC

Herein, under the various packet data streams or flows it isdistinguished between so-called hot flows or streams and cold flows orstreams as explained in more detail below. Hot flows group into acongestion culprit set (CC-set). The congestion culprit set (CC-set) isa group of hot flows that share at least one common hot path segment,i.e. a bottleneck link, between their various sources and (potentiallyall different) destinations. Within the switch fabric built ofhigh-degree switching nodes (8, 32, 256 ports), the flows of a CC-setmay converge and share one or a few segments, then diverge and possiblymerge and diverge again; adaptive routing and dynamic load balancingyield such routing graphs.

Grouping into CC-sets exploits the features of either a static(source-based) or a dynamic look-ahead routing scheme. If to eachblacklist entry a count L, that is the number of hops from the hotspotroot, is added, L can be compared against the routing path downstream ofeach data packet. If an incoming data packet being compared is notmatched as a known hot culprit—because it doesn't have one blacklistentry—this is still not sufficient to declare it “cold” and forward itdownstream. The data packet may share two or more segments (up to L)with any of the previously blacklisted flows, which serve as CC-setrepresentatives; in this case, the current packet is declared “hot” andsubjected to the same restrictions as its CC-set representative—eventhough the current packet (its flow) was not nominally blacklisted. Theadvantage is that there is no need to increase the FC signaling overheadand the blacklist table sizes. The CC-set grouping is considered as adesirable optimization, beneficial in a practical implementation of BFC.Also it may benefit the future adaptive routing and load balancingmethods.

Expedited Upstream Signaling (Static/Dynamic Saturation Tree Control)

As a dynamic flow and congestion control (FCC) mechanism, BFC isdesigned to work in conjunction with, or without, a source-based staticcongestion management (CM) mechanism that provides adequate sourcebehavior, e.g. fair rate adjustments, during steady-state saturationtrees. If both BFC and CM are present, then BFC supports CM as follows:

If one congestion message, i.e. Holdo message, is re-entered two or moretimes in the same blacklist, or if the scope_K of the Hold( ) is clearedto 0, this event generates a so-called backward explicit FC notification(BECN), with expedited propagation all the way towards the sources. Thenumber of blacklist re-entries to trigger this expedited notification isprogrammable.

An expedited propagation has two advantages. First, congestion controland management signaling is accelerated; for sustained hotspots—when CMis appropriate, CM BECN signaling is faster. Second, the packet markingaccuracy improves. A node generates expedited BECNs with a higherconfidence, and only sends them to the affected source(s) whenappropriate. In this way, there are fewer false CM notifications. Areason of inaccuracy is the fact that marking is solely based on abuffer's occupancy in a single node, neither further validated locallyby re-sampling (repetitions) nor in other nodes that could also behotspotted by the same flows.

Finally, an expedited BECN is usable for both the graylist and as animmediate Hold_this( ), whose corresponding entry will be eventuallycleared by either one CM special message, e.g. test packet (TP), or byan attraction event, e.g. a special type of link-level flow-controlcredit.

Vice versa, CM also supports BFC. If every table entry also stores thesource identification along with other data, e.g. destinationidentification, scheduling horizon, occupancy count, then thecongestion-managed traffic sources can remove their own GL entries afterthe respective hot flows were notified by BECNs. For example, onespecial test packet (TP) per such BECN-notified flow will clear itsentries from all the GL tables along the path from the sources to thehotspot. This helps table management by reducing the number of activeentries and their search/match time.

Turning now to the description of the figures, where FIG. 1 illustratesa network of nodes 1A, 1B, 1C, 1D, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D withthree stages. The nodes are implemented as switches in this embodimentof the network. The nodes are part of an arbitrarily arranged networkwith many nodes that is used for transmitting data packets from a sourceto a destination. Depending on the network, the nodes could also beimplemented as routers, computers or other machines that receive datapackets and transmit the data packets to other nodes and thus act asswitches.

In the depicted embodiment each of the nodes comprises two inputs I andtwo outputs O that are connected with inputs or outputs of other nodesby data lines, as indicated by arrows. Data flows are flowing downstreamand control information is flowing in the opposite direction, i.e.upstream. In the known congestion controlling methods a tree of upstreamnodes is blocked if a congestion globalizes. They make nodifferentiation between data packets that cause the congestion and datapackets that are only victims of the congestion if “culprits” and“victims” share the same buffer. If in the shown FIG. 1 the output portconnected with destination B is congested, all upstream nodes that havea direct or indirect connection with that output port are blocked asindicated by the thicker arrows forming a saturation tree 17. Even someflows that do not contribute to the congestion, e.g. the data pathbetween a source node C or node A and the destination node D, areblocked.

FIG. 2 depicts in more detail a local or first node 1 comprising a firstand second input port 4, 9 and a first and second output port 5, 6. Theoutput ports 5, 6 of the first node 1 are connected to a first andsecond link 7, 10. The first input port 4 of the first node 0 isconnected to a third link 14. The second input port 9 is connected witha fourth link 15. A link constitutes a data line that is used forsending data packets downstream and that is used for sending controlinformation and messages upstream. The input and the output ports 4, 9,5, 6 are connected with a first buffer 16, also referred to as localmemory 16. The first buffer 16 is controlled by a local or firstprocessing unit 19.

The data packets are received by the first or the second input port 4,9, and stored in the buffer 16. The first processing unit 19 checks theheader of the data packet for determining the output port 5, 6 to whichthe data packet is to be delivered.

The data packets comprise header and payload information. The headercomprises a source address from which the data packet comes and adestination address to which the data packet is to be sent. Depending onthe destination address, the first processing unit 19 chooses thesuitable output port 5, 6 to transmit the data packet to the respectivedestination address. The first processing unit 19 comprises anadditional memory 60 in which a black- and graylist and routing tableare stored to determine whether the data packet should be transmitted onthe first or the second output port 5, 6, depending on a destinationaddress of the data packet.

The first and the second output port 5, 6 can also be used for receivingcongestion messages that were sent from a downstream node. A receivedcongestion message is stored in the first buffer 16, or in theadditional memory 60 of the first processing unit 19. The congestionmessages are checked by the first processing unit 19 and depending onthe congestion message only data packets not suspected of generatingcongestion in a downstream node will be transmitted through the first orthe second output port 5, 6.

Furthermore, the first processing unit 19 monitors the data packetsdestined for output ports 5, to detect an emerging congestion on outputports 5, 6 before a congestion globalizes. An emerging congestion on anoutput port is detected by the first processing unit 19 if the number ofdata packets waiting for that port exceeds a given threshold. Of course,other methods could be used for detection. If the first processing unit19 detects an emerging congestion, the first processing unit 19generates a first congestion message and sends the first congestionmessage upstream by the first or the second input port 4, 9, or both.

FIG. 3 shows a part of a network with the first, a second and a thirdnode 1, 2, 3, the first, the second and the third node 1, 2, 3, beingidentically structured in this embodiment. The third link 14 isconnected to a third output port 8 of the second node 2. The second node2 comprises a fourth output port 11 and a third and fourth input port12, 13. The third input port is connected by a fifth link 28 to a fifthoutput port 61 of the third node 3. The third node 3 comprisesfurthermore a sixth output port 62 and a fifth and sixth input port 63,64. The fifth and the sixth input port 63, 64 are connected to furtherlinks respectively, e.g. as indicated the fifth input port 63 to sourceaddress A. The ports 9, 11, 13, 62 are also connected to further links.

The second node 2 comprises a second buffer 20 that is connected to theinput and output ports 12, 13, 8, 11. The second buffer 20 is controlledby the second processing unit 21, also referred to as further processingunit 21. The third node 3 comprises a third buffer 23, that is connectedwith the fifth and sixth output and with the fifth and sixth input ports61, 62, 63, 64. The third buffer 23 is controlled by a third processingunit 25.

In the shown embodiment, the data packets are transmitted from the leftside to the right side as indicated by the arrow that is arranged abovethe nodes 1, 2, 3. A data packet comprises a header and preferably apayload information. In the header, an identifier or identifierinformation for the data flow or stream is stored to which the packetbelongs. The identifier comprises a source address and the destinationaddress. The data packets of one data flow comprise the same source andthe same destination address. If one data flow is transmitted fromsource address A to a first destination address B, then the data packetsare sent to the fifth input port 63 of the third node 3. From the fifthinput port 63, the data packets are stored in the third buffer 23. Thethird processing unit 25 checks the header of the data packets forretrieving the destination address. The third processing unit 25 detectsthe first destination address B as a destination address and sends acontrol signal to the third buffer 23 to deliver the data packets to thefifth output port 61. From the fifth output port 61, the data packetsare sent to the third input port 12 over the fifth link 28.

From the third input port 12, the data packets are stored in the secondbuffer 20. The second processing unit 21 checks the header of the storeddata packets and detects the destination address B. Therefore, thesecond processing unit 21 controls the second buffer 20 to deliver thedata packets to the third output port 8. From the third output port 8the data packets are transferred over the third link 14 to the firstinput port 4 of the first node 1.

From the first input port 4, the data packets are stored in the firstbuffer 16. The headers of the stored data packets are checked by thefirst processing unit 19 for detecting the destination address. In thisembodiment, the headers of the data packets comprise the firstdestination address B that can be reached by using the first output port5. Therefore, the first processing unit 19 controls the first buffer 16to deliver the data packet to the first output port 5.

The data packets that are stored in the first buffer 16 and dedicatedfor the first destination address B are put out over the first outputport 5 to the first link 7 that is connected with the first destinationaddress B. If the transmitting capacity of the first link 7 is less thanthe rate of data packets that were delivered by the first and secondinput port 4, 9 to the first node 1 to be transmitted over the firstlink 7, a congestion emerges in the first buffer 16 related to the firstoutput port 5.

The occupancy of the first buffer 16 related to the first or secondoutput port 5, 6 is monitored by the first processing unit 19. Anemerging congestion is detected by the first processing unit before thecongestion globalizes. If the first processing unit 19 detects anemerging congestion in the first buffer 16 for the first and/or secondoutput port 5, 6, the first processing unit 19 checks the stored datapackets that are suspected of congestion for packet marking. Theidentifier of one or more data flows to which the suspect data packetsbelong are used in the congestion messages.

Furthermore, the first processing unit 19 calculates in a advantageousembodiment a first hold time T1 during which no further data packets,destined for the congested first output port 5, should be transmitted tothe first node 1, i.e. the first buffer 16, to counteract the spreadingof a saturation tree. The first processing unit 19 generates a firstcongestion message comprising the first hold time T1 and an output portidentifier for the output port at which a congestion emerges, assumingthe first node 1 is the root/origin of the congestion.

The first hold time should be calculated to counteract a saturation treethat is originated by a congestion of data packets and to ensure thatenough data packets are stored in the first buffer for utilizing theavailable transmitting capacity of first or second output port 5, 6 andthe first or second link 7, 10. The first hold time T1 is calculatedrelated to the transmitting capacity of the first link 7 and the amountof data packets that are stored in the first buffer 16—dedicated to thefirst output port 5. Furthermore it is advantageous to consider the timethe congestion message takes to go up to the next node and the time adata packet takes to flow down from the upstream node to the node fromwhich the congestion message was sent. This time is named round triptime (RTT). Considering the RTT in calculating the hold time results ina more precise controlling method for counteracting saturation trees andconcurrently avoiding depleting the first buffer 16 of data packets sothat the available transmitting capacity of the first output port 5 canbe used efficiently. For using the round trip time it is helpful tomeasure RTT or to have a look-up table in which the RTT of the nextupstream node is stored.

In an advantageous embodiment, the first processing unit 19 calculatesthe duration of the hold time in the following manner: NCP=sum ofoccupancies of the first input buffer 16 for an output, i.e., the numberof data packets that are currently not on hold, where the hold time iscalculated in packet cycle times: Hold time=max(NCP−RTT, 0), where RTTis the round trip time expressed in number of packet cycles. If the holdtime is zero, preferably no congestion message is sent. The abbreviationNCP stands for the number of cold packets that can flow through thefirst output port 5 without congestion. The number of cold packets is afirst-order approximation of the number of freely moving packets,belonging to cold flows, i.e. the flows that are not congestionculprits. The cold flows are expected to flow uninhibited; however,unlike the hot flows that cause congestions, they may cause bufferunderflow.

The hold time is calculated such that new data packets arrive just intime to prevent an underflow of the buffer for a given output port. Thepackets of the hot (congestion-causing) flows that are on hold locallyin a buffer do not move and therefore their number does not appear inthe calculation. The subtraction of RTT from the NCP is a compensationfor two delays: first, it compensates for the time it takes thecongestion message to travel upstream to the next node and second, itcompensates for the time it takes a data packet to travel from theupstream node once the hold time has expired down to the node which sentthe congestion message. Congestion messages with hold times are onlysent if the hold duration is not zero. The hold time is in thisembodiment expressed in time units of a packet cycle within the node.The hold time is used in the transmission scheduling of an upstreamnode. As long as the hold time has not expired, the transmissionscheduling will not schedule any data packets that the hold function ofthe congestion message applies to. The transmission scheduling horizonis the time when the hold time expires. The hold time is part of thecongestion message and tells the upstream nodes how long data packets ofculprit data flows have to be held back.

Depending on the embodiment of the transmission schedulers (TXS), datapackets belonging to different data flows may be transmitted by atransmission scheduler where the congestion may arise caused by adetermined data flow. Other data flows can still be transmitted over thefirst link 7 without any congestion. Therefore, it is useful to detectwhich data flows are causing a congestion.

After detecting an emerging congestion in the first buffer 16 andgenerating the first congestion message, the first processing unit 19sends the first congestion message over the third link 14 upstream tothe second node 2 that is arranged directly upstream of the first node1, and possibly also on other ports/links, such as the second input port9. The first congestion message is received by the third output port 8and stored in the second buffer 20, or control unit 21, for example. Thefirst congestion message is analyzed by the second processing unit 21.The second processing unit 21 detects the information about thecongested output port in the first node 1. The first congestion messagecan be constituted as an output port message at the root node of thecongestion comprising a port identifier for the congested output portand a hold time. In the node upstream from the root node of thecongestion, the first congestion message is transferred to a secondcongestion message, also referred to as data flow message. Generally,the output port message, i.e. the first congestion message, applies toall data packets travelling to the indicated congested downstream outputport and is sent upstream from the node with a hotspotted output, e.g.from the root node of the congestion.

Normally, the first congestion message or output port message travelsonly one stage upstream. It may comprise only an output port number. Forefficiency, the flow identifiers of all the flows destined for thatoutput port are not put on the blacklist. During the hold duration, thewithholding is performed at the transmission schedulers that controlwhich data packets are allowed to travel on the links to the downstreamnode.

Then, the second processing unit 21 checks the data packets that arestored in the second buffer 20. If the second processing unit 21 detectsdata packets with an identifier of the blacklist, then these datapackets are held back until the first hold time T1 expires naturally or,for example, by an attraction message from the downstream first node 1.The attraction is a message from a downstream stage indicatingavailability of specific resources. Such an attraction message maycomprise an identifier for the data packets that are allowed to be sentby the upstream node (e.g. a credit).

The data packets in the second node 2 that are not destined to acongested output port of the first node 1 are transmitted to the firstnode 1. Therefore, only the data packets are withheld that probablycause the congestion and the other data packets can freely flow via thefirst node 1.

The second processing unit 21 monitors the second buffer 20 fordetecting an emerging congestion. If the second processing unit 21detects an emerging congestion at an output port, it analyzes the datapackets that cause this congestion and determines the identifier for thedata flow to which the data packets belong. The second processing unit21 generates a second congestion message as described above. The secondcongestion message comprises an identifier for the data flow of the datapackets that are suspected of generating a congestion in the second node2. The second congestion message can further comprise a second hold timeT2 that is calculatable by using the local context (NCP, RTT) of thesecond node 2. The second congestion message is transmitted over thefifth link 28 to the third node 3.

The data flow message, that is the second congestion message, applies todata packets going to one end destination address. These data packetsare identified by the flow identification that is enclosed in a datapacket header. The end destination address of the flow identification isput in an entry on the blacklist at the receiving node. The data flowmessage can only be sent by nodes that are arranged at least one stageupstream from the root node of the congestion. Any of these nodes canrecalculate the hold time if it sends the data flow message upstream.The recalculation of the hold time of the data flow message is based onthe local number of non-congesting packets, on the local round trip time(RTT) with the next upstream node. The number of non-congesting packetsare those data packets that keep flowing while the data flow congestionmessage travels upstream, while the hold time is active and new datapackets are travelling downstream after the hold time expires. As onlydata packets of specific blacklisted data flows are held, the number ofthe non-congesting data packets (NCP) is defined by all currently in thelocal node available data packets not belonging to black-listed dataflows.

If a data flow message is received by the third node 3, the data flowidentifier of the data flow message is stored in the blacklist.

Preferably, each node has a blacklist. All data packets belonging to ablack-listed data flow are held at an output port until its current holdtime expires. Depending on the used identifiers, the identified dataflows and/or the data packets predetermined for the identified outputport are held back during the respective hold times.

The third processing unit 25 of the third node 3 stores the identifierof the congesting data packets in the blacklist of the third buffer 23.As it is shown, the congestion messages are transmitted in reversedirection with respect to the data packet flow.

In an embodiment, the second processing unit 21 transmits the secondcongestion message to the nodes that are arranged upstream and connectedwith the second node 2, although there is no congestion in the secondnode 2. This approach could be used for an output message and/or a dataflow message.

In an advantageous embodiment, the second congestion message comprisesan information on the number of stages that the congestion messageshould automatically be delivered to upstream nodes in the absence of anemerging congestion in the upstream node. Using this feature, the secondcongestion message is propagated upstream even sooner than the case inwhich congestion messages are propagated only one stage upstream.

In a further advantageous embodiment, in a local or congested node, alocal mask duration is calculated for preventing attracting messages fordata packets belonging to data flows that are withheld by the congestednode's upstream node. It is not efficient to attract data packets for acongested output that is or will be on hold. Such attracting messagesshould be masked. The duration of the masking is:

=NCP−(RTT/2) in time units of packet cycles, whereby NCP is the sum ofdata packets that are currently not on hold in the congested node, andRTT describes the round trip time of the input link of the switch withthe congested output, which means the time it would take for a controlmessage to travel from a node to an upstream node plus the time it wouldtake for a data packet to travel from the upstream node down to the nodethat sent the control message. The time when the local masking expiresis called the local attraction horizon. When the hold time in anupstream node has expired, that node may start sending new data packetsdownstream. At that moment, new attractions should arrive from thedownstream switch. The attractions are sent earlier by half of the roundtrip time to arrive on time. All values are expressed in packet cycles.

In the situation indicated with FIG. 4, there is an emerging congestionat the seventh output port 32 of the fourth node 30, that is the hotspot(HS). The fourth processing unit 65 of the fourth node 30 initiates thetransmission of a first congestion message 50 comprising Hold_allinformation as output port congestion message by the seventh and theeighth input port 33, 34 to the upstream fifth node 35 and the sixthnode 36. Here the first congestion message 50 comprises a hold time Tand the information that the seventh output port 32 of the fourth node30 is suffering from congestion. The sixth node 36 withholds datapackets that are destined for output port 32 of node 30, and, if theincoming culprit flow persists, a congestion occurs at the output portof node 36 that is connected to the fourth node 30. Then, the fifthprocessing unit 66 of the sixth node 36 generates itself a secondcongestion message 51 comprising Hold_this information as data flowcongestion message with a further hold time T′ and the information whichdata flow is suspected of causing the congestion. The second congestionmessage 51 is sent from the sixth node 36 to the upstream nodes. Theseventh node 37 stores the identifier of the congesting data packets inthe blacklist of the additional memory 39.

In a advantageous embodiment, the second congestion message 51 comprisesthe information that the data flow congestion message should bedelivered to a predetermined number of consecutive stages upstream. Inthis embodiment the second congestion message should be sent two stagesupstream. This means that the sixth node 36, that is the first stageupstream to the fourth node 30, delivers the data flow congestionmessage upstream to the seventh node 37 and from there one stagefurther.

The fifth processing unit 66 of the sixth node 36 calculates the holdtime of each data flow congestion message it sends. This hold timedepends on local NCP and RTT for nodes that are not the root of thecongestion as explained above. As it is indicated with FIG. 4, thesecond congestion message 51 is individually transmitted by node 36 tothe next upstream node 37 and, if a congestion occurs, the upstream nodegenerates for itself a further second or data flow congestion messagethat comprises an identifier for the data packets that cause thecongestion at the node and may comprise a hold time.

The seventh node 37 and possible nodes further upstream from the seventhnode 37 use the blacklist provided in the additional memory 39 fordiscriminating the data packets that cause the congestion, the so-calledculprit packets, from the data packets that suffer from the congestionwithout being responsible, the so-called victim packets. If as anidentifier the data flow from an end-to-end transmitting path is usedfor determining the culprit data packets, all data packets belonging tothis data flow are blocked in the node if there is a congestiondownstream that is caused by the data flow.

An overview of the operation at the various locations or nodes is givenby the following transcription. Assume port 32 of local node 30 isbackpressured by a slow consumer at the receiving end of the link(bottleneck). If the arrivals on input ports 33, 34 destined for port 32exceed the service rate of the bottleneck link, this hotspot triggersthe root of one saturation tree. The objective is to counteract thesaturation tree, first locally, then globally, without reducing theaggregate network performance.

@Root Node 30:

-   -   hotspot (HS) detection at the output port 32 based on buffer        status, downstream (rate, no. of credits) and fabric parameters        (up/down link RTTs);    -   global marking condition>threshold=>all flows going thru output        port 32 are considered culprits;    -   issue a Congested event=>Hold_all (30|32, scheduling horizon,        info) message 50 is sent upstream to upstream nodes 35, 36        @ k=1 Upstream Nodes 35,36:    -   receive Hold_all(30|32, scheduling horizon, info) from local        node 30;    -   store in blacklist BL entry comprising Hold_all information    -   if scope_K=0=>expedited forwarding upstream;    -   validation of HS event, marking refinement:        -   if (t>T_lim OR HS_severity>HS_thshld) AND (buff            occupancy>M_thshld) OR (new arrivals headed to 30|32) then            -   for every culprit arrival, or count for all the culprits                in buff_occupancy, issue the congestion message 51 as                -   Congesting_culprit                    event=>Hold_this(flow|set_identification, scheduling                    horizon, info), or                -   Congesting_suspect                    event=>Hold_this(flow|set_identification, info), if                    under the limits required for Congesting_culprit.                    @ k=2 Upstream Node 37:    -   receive the congestion message 51 from nodes 35, 36:        Hold_all(30|32, scheduling horizon, info), or,        Hold_this(flow|set_identification, scheduling horizon, info),        or, Hold_this(flow|set_identification, scheduling horizon,        info);    -   if Hold_this(flow|set_identification, scheduling horizon,        info)=>store BL entry representative (any subsequent flow ID        matching this will be stopped until the horizon t expires);    -   if Hold_this(flow|set identification, info)=>store GL entry        representative (to serve as prime candidate for future marking,        should a new HS root appear);    -   if scope_K=0=>expedited forwarding upstream;    -   else (k>0), recalculate the horizon to be sent upstream; issue        -   Congesting_culprit event=>Hold_this(flow|set_identification,            scheduling horizon, info), or        -   Congesting_suspect event=>Hold_this(flow|set_identification,            info), if under the limits required for Congesting_culprit.            @Ingress Sources (if scope_K=0)    -   receive Hold_all(30|32, scheduling horizon, info), or,        Hold_this(flow|set identification, scheduling horizon, info),        or, Hold_this(flow|set_identification, scheduling horizon,        info);    -   perform rate or window adjustment (reduce injections);    -   after a calculated period send or broadcast downstream a test        packet (TP) to clear the GL entries in the nodes.

The advantages of the present invention, however, are not limited to thedescribed embodiments, but could also be used in a data network that isconstituted differently, particularly with other kind of switches. Thepresent invention can be realized in hardware, software, or acombination of hardware and software. A visualization tool according tothe present invention can be realized in a centralized fashion in onecomputer system, or in a distributed fashion where different elementsare spread across several interconnected computer systems. Any kind ofcomputer system—or other apparatus adapted for carrying out the methodsand/or functions described herein—is suitable. A typical combination ofhardware and software could be a general purpose computer system with acomputer program that, when being loaded and executed, controls thecomputer system such that it carries out the methods described herein.The present invention can also be embedded in a computer programproduct, which comprises all the features enabling the implementation ofthe methods described herein, and which—when loaded in a computersystem—is able to carry out these methods. Computer program means orcomputer program in the present context include any expression, in anylanguage, code or notation, of a set of instructions intended to cause asystem having an information processing capability to perform aparticular function either directly or after conversion to anotherlanguage, code or notation, and/or after reproduction in a differentmaterial form.

Thus the invention includes an article of manufacture which comprises acomputer usable medium having computer readable program code meansembodied therein for causing a function described above. The computerreadable program code means in the article of manufacture comprisescomputer readable program code means for causing a computer to effectthe steps of a method of this invention. Similarly, the presentinvention may be implemented as a computer program product comprising acomputer usable medium having computer readable program code meansembodied therein for causing a function described above. The computerreadable program code means in the computer program product comprisingcomputer readable program code means for causing a computer to effectone or more functions of this invention. Furthermore, the presentinvention may be implemented as a program storage device readable bymachine, tangibly embodying a program of instructions executable by themachine to perform method steps for causing one or more functions ofthis invention.

It is noted that the foregoing has outlined some of the more pertinentobjects and embodiments of the present invention. This invention may beused for many applications. Thus, although the description is made forparticular arrangements and methods, the intent and concept of theinvention is suitable and applicable to other arrangements andapplications. It will be clear to those skilled in the art thatmodifications to the disclosed embodiments can be effected withoutdeparting from the spirit and scope of the invention. The describedembodiments ought to be construed to be merely illustrative of some ofthe more prominent features and applications of the invention. Otherbeneficial results can be realized by applying the disclosed inventionin a different manner or modifying the invention in ways known to thosefamiliar with the art.

1. A method for counteracting a saturation tree occurring in a networkhaving nodes, comprising: generating at a local node where a congestionemerges a first congestion information; sending the first congestioninformation to at least one upstream node; responsive to one receivedfirst congestion information comparing the content of the received firstcongestion information with a present local status based on a set ofrules in order to identify at least one packet stream causing thecongestion, and generating a second congestion information comprisingthe identified at least one packet stream causing the congestion; andsending the second congestion information to at least one furtherupstream node.
 2. A method according to claim 1, wherein the step ofsending the second congestion information comprises forwarding thesecond congestion information to a source node from which the identifiedat least one packet stream originates.
 3. A method according to claim 1,wherein the step of generating a first congestion information at a localnode further comprises detecting the emerging congestion by applying theset of rules.
 4. A method according to claim 1, wherein the firstcongestion information comprising an identifier identifying a congestedroot is sent upstream, the at least one upstream node receiving thefirst congestion information stores in dependence on the set of rulesthe identifier in one of a first list indicating the packet streamcausing the congestion and a second list indicating packet streamssuspected of causing congestion.
 5. A method according to claim 4,wherein the second congestion information comprising the identifier issent further upstream, the at least one further upstream node receivingthe second congestion information store in dependence on the set ofrules the identifier in one of the first list and the second list.
 6. Amethod according to claim 4, wherein the identifier identifies multipledata flows.
 7. A method according to claim 4 further comprisingcomparing based on the set of rules each identifier of incoming packetstreams with the identifiers stored in the second list, if a conditionof the set of rules with respect to one identifier holds, the oneidentifier is transferred from the second list to the blacklist.
 8. Amethod according to claim 4, wherein the second congestion informationwith the identifier is sent upstream if the identifier is stored morethan once in the first list.
 9. A method according to claim 4, wherein areceiving source node at least reduces the sending of the packet streamidentified by the identifier and sends a test packet to the local nodewhere the congestion emerged, wherein entries of the received identifierin the second lists are removed along the way of the test packet to thelocal node where the congestion emerged.
 10. A switching fabric forcounteracting a saturation tree occurring in a network having nodes,comprising a local processing unit and a local memory adapted togenerate at a local node where a congestion emerges a first congestioninformation; a local port controlled by the local processing unit forsending the first congestion information to at least one upstream node;a further processing unit and a further memory adapted to compare thecontent of the received first congestion information with a presentlocal status based on a set of rules in order to identify at least onepacket stream causing the congestion, and to generate a secondcongestion information comprising the identified at least one packetstream causing the congestion; and a further port for sending the secondcongestion information to at least one further upstream node.
 11. Amethod according to claim 1, wherein the second congestion informationcomprises an expedite information.
 12. The switching fabric according toclaim 10 further comprising a first list for storing an identifierindicating a data stream causing the congestion, and a second list forstoring the identifier indicating data streams suspected of causingcongestion.
 13. An article of manufacture comprising a computer usablemedium having computer readable program code means embodied therein forcausing counteraction of a saturation tree occurring in a network havingnodes, the computer readable program code means in said article ofmanufacture comprising computer readable program code means for causinga computer to effect the steps of: generating at a local node where acongestion emerges a first congestion information; sending the firstcongestion information to at least one upstream node; responsive to onereceived first congestion information comparing the content of thereceived first congestion information with a present local status basedon a set of rules in order to identify at least one packet streamcausing the congestion, and generating a second congestion informationcomprising the identified at least one packet stream causing thecongestion; and sending the second congestion information to at leastone further upstream node.
 14. A program storage device readable bymachine, tangibly embodying a program of instructions executable by themachine to perform method steps for counteracting a saturation treeoccurring in a network having nodes, said method steps comprising thesteps of claim
 1. 15. A program storage device readable by machine,tangibly embodying a program of instructions executable by the machineto perform method steps for counteracting a saturation tree occurring ina network having nodes, said method steps comprising the steps of claim2.
 16. A program storage device readable by machine, tangibly embodyinga program of instructions executable by the machine to perform methodsteps for counteracting a saturation tree occurring in a network havingnodes, said method steps comprising the steps of claim
 3. 17. A programstorage device readable by machine, tangibly embodying a program ofinstructions executable by the machine to perform method steps forcounteracting a saturation tree occurring in a network having nodes,said method steps comprising the steps of claim
 4. 18. A program storagedevice readable by machine, tangibly embodying a program of instructionsexecutable by the machine to perform method steps for counteracting asaturation tree occurring in a network having nodes, said method stepscomprising the steps of claim
 5. 19. A computer program productcomprising a computer usable medium having computer readable programcode means embodied therein for causing functions of a switching fabricfor counteracting a saturation tree occurring in a network having nodes,the computer readable program code means in said computer programproduct comprising computer readable program code means for causing acomputer to effect the functions of claim
 10. 20. A computer programproduct comprising a computer usable medium having computer readableprogram code means embodied therein for causing functions of a switchingfabric for counteracting a saturation tree occurring in a network havingnodes, the computer readable program code means in said computer programproduct comprising computer readable program gram code means for causinga computer to effect the functions of claim 12.